Optimal Feature Set and Minimal Training Size for Pronunciation Adaptation in TTS

نویسندگان

  • Marie Tahon
  • Raheel Qader
  • Gwénolé Lecorvé
  • Damien Lolive
چکیده

Text-to-Speech (TTS) systems rely on a grapheme-to-phoneme converter which is built to produce canonical, or statically stylized, pronunciations. Hence, the TTS quality drops when phoneme sequences generated by this converter are inconsistent with those labeled in the speech corpus on which the TTS system is built, or when a given expressivity is desired. To solve this problem, the present work aims at automatically adapting generated pronunciations to a given style by training a phoneme-to-phoneme conditional random field (CRF). Precisely, our work investigates (i) the choice of optimal features among acoustic, articulatory, phonological and linguistic ones, and (ii) the selection of a minimal data size to train the CRF. As a case study, adaptation to a TTS-dedicated speech corpus is performed. Cross-validation experiments show that small training corpora can be used without much degrading performance. Apart from improving TTS quality, these results bring interesting perspectives for more complex adaptation scenarios towards expressive speech synthesis.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving TTS with Corpus-Specific Pronunciation Adaptation

Text-to-speech (TTS) systems are built on speech corpora which are labeled with carefully checked and segmented phonemes. However, phoneme sequences generated by automatic grapheme-to-phoneme converters during synthesis are usually inconsistent with those from the corpus, thus leading to poor quality synthetic speech signals. To solve this problem, the present work aims at adapting automaticall...

متن کامل

Efficient adaptation of TTS duration model to new speakers

This paper discusses a methodology using a minimal set of sentences to adapt an existing TTS duration model to capture interspeaker variations. The assumption is that the original duration database contains information of both language-specific and speaker-specific duration characteristics. In training a duration model for a new speaker, only the speaker-specific information needs to be modeled...

متن کامل

Speaker adaptation using a parallel phone set pronunciation dictionary for Thai-English bilingual TTS

This paper develops a bilingual Thai-English TTS system from two monolingual HMM-based TTS systems. An English Nagoya HMM-based TTS system (HTS) provides correct pronunciations of English words but the voice is different from the voice in a Thai HTS system. We apply a CSMAPLR adaptation technique to make the English voice sounds more similar to the Thai voice. To overcome a phone mapping proble...

متن کامل

Implementation and test of a serious game based on minimal pairs for pronunciation training

This paper introduces the architecture and interface of a serious game intended for pronunciation training and assessment for Spanish students of English as second language. Users will confront a challenge consisting in the pronunciation of a minimal-pair word battery. Android ASR and TTS tools will prove useful in discerning three different pronunciation proficiency levels, ranging from basic ...

متن کامل

Sample-oriented Domain Adaptation for Image Classification

Image processing is a method to perform some operations on an image, in order to get an enhanced image or to extract some useful information from it. The conventional image processing algorithms cannot perform well in scenarios where the training images (source domain) that are used to learn the model have a different distribution with test images (target domain). Also, many real world applicat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016